373 research outputs found

    A flat model approach to Ziegler-Natta olefin polymerization catalysts

    Get PDF

    When Hashes Met Wedges: A Distributed Algorithm for Finding High Similarity Vectors

    Full text link
    Finding similar user pairs is a fundamental task in social networks, with numerous applications in ranking and personalization tasks such as link prediction and tie strength detection. A common manifestation of user similarity is based upon network structure: each user is represented by a vector that represents the user's network connections, where pairwise cosine similarity among these vectors defines user similarity. The predominant task for user similarity applications is to discover all similar pairs that have a pairwise cosine similarity value larger than a given threshold τ\tau. In contrast to previous work where τ\tau is assumed to be quite close to 1, we focus on recommendation applications where τ\tau is small, but still meaningful. The all pairs cosine similarity problem is computationally challenging on networks with billions of edges, and especially so for settings with small τ\tau. To the best of our knowledge, there is no practical solution for computing all user pairs with, say τ=0.2\tau = 0.2 on large social networks, even using the power of distributed algorithms. Our work directly addresses this challenge by introducing a new algorithm --- WHIMP --- that solves this problem efficiently in the MapReduce model. The key insight in WHIMP is to combine the "wedge-sampling" approach of Cohen-Lewis for approximate matrix multiplication with the SimHash random projection techniques of Charikar. We provide a theoretical analysis of WHIMP, proving that it has near optimal communication costs while maintaining computation cost comparable with the state of the art. We also empirically demonstrate WHIMP's scalability by computing all highly similar pairs on four massive data sets, and show that it accurately finds high similarity pairs. In particular, we note that WHIMP successfully processes the entire Twitter network, which has tens of billions of edges

    FLASH: Randomized Algorithms Accelerated over CPU-GPU for Ultra-High Dimensional Similarity Search

    Full text link
    We present FLASH (\textbf{F}ast \textbf{L}SH \textbf{A}lgorithm for \textbf{S}imilarity search accelerated with \textbf{H}PC), a similarity search system for ultra-high dimensional datasets on a single machine, that does not require similarity computations and is tailored for high-performance computing platforms. By leveraging a LSH style randomized indexing procedure and combining it with several principled techniques, such as reservoir sampling, recent advances in one-pass minwise hashing, and count based estimations, we reduce the computational and parallelization costs of similarity search, while retaining sound theoretical guarantees. We evaluate FLASH on several real, high-dimensional datasets from different domains, including text, malicious URL, click-through prediction, social networks, etc. Our experiments shed new light on the difficulties associated with datasets having several million dimensions. Current state-of-the-art implementations either fail on the presented scale or are orders of magnitude slower than FLASH. FLASH is capable of computing an approximate k-NN graph, from scratch, over the full webspam dataset (1.3 billion nonzeros) in less than 10 seconds. Computing a full k-NN graph in less than 10 seconds on the webspam dataset, using brute-force (n2Dn^2D), will require at least 20 teraflops. We provide CPU and GPU implementations of FLASH for replicability of our results

    Optimal Data-Dependent Hashing for Approximate Near Neighbors

    Full text link
    We show an optimal data-dependent hashing scheme for the approximate near neighbor problem. For an nn-point data set in a dd-dimensional space our data structure achieves query time O(dnρ+o(1))O(d n^{\rho+o(1)}) and space O(n1+ρ+o(1)+dn)O(n^{1+\rho+o(1)} + dn), where ρ=12c21\rho=\tfrac{1}{2c^2-1} for the Euclidean space and approximation c>1c>1. For the Hamming space, we obtain an exponent of ρ=12c1\rho=\tfrac{1}{2c-1}. Our result completes the direction set forth in [AINR14] who gave a proof-of-concept that data-dependent hashing can outperform classical Locality Sensitive Hashing (LSH). In contrast to [AINR14], the new bound is not only optimal, but in fact improves over the best (optimal) LSH data structures [IM98,AI06] for all approximation factors c>1c>1. From the technical perspective, we proceed by decomposing an arbitrary dataset into several subsets that are, in a certain sense, pseudo-random.Comment: 36 pages, 5 figures, an extended abstract appeared in the proceedings of the 47th ACM Symposium on Theory of Computing (STOC 2015

    Approximate Near Neighbors for General Symmetric Norms

    Full text link
    We show that every symmetric normed space admits an efficient nearest neighbor search data structure with doubly-logarithmic approximation. Specifically, for every nn, d=no(1)d = n^{o(1)}, and every dd-dimensional symmetric norm \|\cdot\|, there exists a data structure for poly(loglogn)\mathrm{poly}(\log \log n)-approximate nearest neighbor search over \|\cdot\| for nn-point datasets achieving no(1)n^{o(1)} query time and n1+o(1)n^{1+o(1)} space. The main technical ingredient of the algorithm is a low-distortion embedding of a symmetric norm into a low-dimensional iterated product of top-kk norms. We also show that our techniques cannot be extended to general norms.Comment: 27 pages, 1 figur

    Mineral prices persistence and the development of a new energy vehicle industry in China: A fractional integration approach

    Get PDF
    In this paper we examine price persistence in a set of minerals critical for the production of new energy vehicles. We implement techniques based on fractional integration also allowing for non-linearities and structural breaks at unknown periods of time. The results show that the series are generally very persistent, with orders of integration equal to or higher than 1 in practically all cases. The only exceptions being cobalt, tin and zinc if breaks are permitted and only for a given subsample. These findings are extremely relevant to initiate a discussion about the challenges that the new energy vehicle industry faces in China. China's government has already enforced some relevant initiatives to stabilise prices, but we conclude that additional measures will be necessary considering the high degree of uncertainty of certain supply-demand factors.Prof. Luis A. Gil-Alana gratefully acknowledges financial support from the MINEIC-AEI-FEDER PID2020-113691RB-I00 project from ‘Ministerio de Economía, Industria y Competitividad’ (MINEIC), ‘Agencia Estatal de Investigación’ (AEI) Spain and ‘Fondo Europeo de Desarrollo Regional’ (FEDER). He also acknowledges support from an internal Project of the Universidad Francisco de Vitoria, Madrid, Spain

    Persistence and trends in CO2 emissions in Africa: is Chinese FDI behind these features?

    Get PDF
    In this article, we investigate the statistical features of the CO2 emissions and CO2 emissions per capita in a group of 45 African countries by looking at their degree of persistence and also testing for the existence of trends in the data. In addition, we also investigate if this level of emissions is related to the Chinese FDI in Africa. The results are very heterogeneous across countries, observing orders of integration statistically below 1 in a group of countries; in others, the majority of them, the values are around 1, while for some others, the degree of integration is statistically significantly above 1. Linear time trends are observed in approximately half of the countries. These results imply that, in the long term, public measures to reduce CO2 emissions may be required in the majority of the countries since in the event of shocks the series will not return by themselves to their original levels. If we look at Chinese FDI in these countries, we observe that there seems to be no relationship between the Chinese investment in Africa and the CO2 emission persistence, though this result needs to be contrasted in future research.post-print516 K

    El rey o el papa. La crisis de lealtades del alto clero español a través de la controversia de 1799 en la Rota de la Nunciatura

    Get PDF
    Este texto examina la crisis del marco de doble lealtad que, durante el Antiguo Régimen, había vinculado al alto clero simultáneamente a la Corona y a la Santa Sede. La aproximación al desequilibrio de esta fidelidad compartida se realiza desde el nivel micro, tomando como ejemplo la controversia que tuvo lugar en la Rota de la nunciatura de España como consecuencia del real decreto de 5 de septiembre de 1799, en virtud del cual los obispos y algunos tribunales regios ejercerían, durante la vacante de la silla pontificia ocasionada por la muerte de Pío VI, algunas facultades reservadas a la Santa Sede. El seguimiento detallado de las trayectorias previas de los participantes en la controversia explica, en buena medida, la postura eclesiológica que sostuvieron.This paper examines the crisis in the long-term relationship model between the Spanish upper clergy, the Crown and the Papacy. Throughout the ancien régime, the high-ranking secular clergy divided its loyalty between two sovereign powers without any major problem. But this double loyalty underwent a crisis in the second half of the eighteenth century, aggravated by the French Revolution and the international political context. The controversy that arose between the members of the Spanish Rota tribunal concerning the royal decree of 1799, which ordered bishops and some royal courts to assume functions reserved to the Holy See, shows on a micro level the factors that led people to choose one loyalty over another

    Fast Locality-Sensitive Hashing Frameworks for Approximate Near Neighbor Search

    Full text link
    The Indyk-Motwani Locality-Sensitive Hashing (LSH) framework (STOC 1998) is a general technique for constructing a data structure to answer approximate near neighbor queries by using a distribution H\mathcal{H} over locality-sensitive hash functions that partition space. For a collection of nn points, after preprocessing, the query time is dominated by O(nρlogn)O(n^{\rho} \log n) evaluations of hash functions from H\mathcal{H} and O(nρ)O(n^{\rho}) hash table lookups and distance computations where ρ(0,1)\rho \in (0,1) is determined by the locality-sensitivity properties of H\mathcal{H}. It follows from a recent result by Dahlgaard et al. (FOCS 2017) that the number of locality-sensitive hash functions can be reduced to O(log2n)O(\log^2 n), leaving the query time to be dominated by O(nρ)O(n^{\rho}) distance computations and O(nρlogn)O(n^{\rho} \log n) additional word-RAM operations. We state this result as a general framework and provide a simpler analysis showing that the number of lookups and distance computations closely match the Indyk-Motwani framework, making it a viable replacement in practice. Using ideas from another locality-sensitive hashing framework by Andoni and Indyk (SODA 2006) we are able to reduce the number of additional word-RAM operations to O(nρ)O(n^\rho).Comment: 15 pages, 3 figure

    Deep Discrete Hashing with Self-supervised Pairwise Labels

    Full text link
    Hashing methods have been widely used for applications of large-scale image retrieval and classification. Non-deep hashing methods using handcrafted features have been significantly outperformed by deep hashing methods due to their better feature representation and end-to-end learning framework. However, the most striking successes in deep hashing have mostly involved discriminative models, which require labels. In this paper, we propose a novel unsupervised deep hashing method, named Deep Discrete Hashing (DDH), for large-scale image retrieval and classification. In the proposed framework, we address two main problems: 1) how to directly learn discrete binary codes? 2) how to equip the binary representation with the ability of accurate image retrieval and classification in an unsupervised way? We resolve these problems by introducing an intermediate variable and a loss function steering the learning process, which is based on the neighborhood structure in the original space. Experimental results on standard datasets (CIFAR-10, NUS-WIDE, and Oxford-17) demonstrate that our DDH significantly outperforms existing hashing methods by large margin in terms of~mAP for image retrieval and object recognition. Code is available at \url{https://github.com/htconquer/ddh}
    corecore